Table of Contents

Loading and preparing data

Loading original participant data (NZ)

Melt dataframe to long format

Loading replication participant data (US)

Loading 2nd replication data (US, with reading measures)

Compute reading subscales

Retrieving vectors for words and dimension word pairs

Ranking color-semantic associations in word embeddings

Using single dimension words

Using dimension axes (word pair contrasts), with nearest neighbor (cosine) method

Creating datasets for statistical models

Merging data and predictors

Merge data

Add word frequency (Van Paridon & Thompson, 2020)

Add concreteness (Brysbaert et al., 2014)

Add Small World of Words associations (De Deyne et al., 2018)

(It looks like there very few responses from NZ, but a little more from US and elsewhere.)

Add cosine distances (Mikolov et al., 2013)

Common Crawl

Subtitles

COCA

Filtered COCA

COCA embeddings, but from COCA corpora without sentences with 1st order cooccurrences (sentences with a color word and a dimension word).

No neighbors COCA

COCA embeddings, but from training corpora from which the 100 nearest neighbors of each dimension word have been removed (in an attempt to disrupt the "scaffolding" that semantic associations with the dimension words are built on).

No names COCA

COCA embeddings, but from training corpora from which the labels generated by at least two participants for color-semantic associations (e.g. the label snow for the combination white and cold) has been removed. (These nameability data are explored in more detail in a section at the end of this notebook.)

Correlations between predictors

Correlations between predictors in original data (NZ)

Correlations between predictors in 2nd replication data (US)

Correlations between predictors in full dataset

Standardize predictors and write to file

Nameability of color-dimension associations

Exporting names generated by participants for use in training corpus filtering

Correlating COCA-fiction cosine similarities to nameability measures

Since we only have nameability for colors and dimension axis poles (i.e. for yellow and dislike but not yellow and dislike-like), we correlate nameability measures with cosine similarity between color and dimension axis pole.

Correlating group-averaged human ratings to nameability measure differentials.

Since we do not have human ratings for the association between colors and dimension axis poles (only for association between colors and dimension axes), we need to collapse our nameability measures for the two poles of each dimension axis. One way to do this is to compute difference scores.

Correlation of group-averaged split-inverse ratings with nameability measures

One other way to work around the issue of having only color to dimension axis pole nameability is to split and invert the human ratings of color-dimension axis associations to create two scores per rating: One for the right end of the axis (equal to the rating), and one for the left end of the axis (equal to eight minus the rating). For example: If yellow is assigned a 6 on the scale dislike-like, the rating for yellow/like is 6, but we also create a rating of 2 for yellow/dislike.

In short: nameability (measured as simpson diversity and name agreement for the modal name) is weakly correlated with cosine similarity between colors and dimension axis poles, but not with human ratings, regardless of whether we fit the nameability to the ratings (by computing difference scores for the nameability measures) or fit the ratings to the nameability (by computing inverse ratings for the left poles of the dimension axes).

Extracting non-color nearest neighbors for each dimension

More figures

Mean color ratings on each dimension

Scatterplot with connected points

Convert notebook to html